magiclanternfandomcom-20200223-history
IDAPython/Firmware matching
This is work in progress; for now, I'm just writing this for ordering my thoughts. Since IDAPython was extremely slow when iterating through all the firmware, I've rewritten the firmware matching scripts from scratch. They do not depend on IDAPython any more; instead, they use the IDC files. State of Art: gensig/finsig Arm.Indy did a good job by experimenting a lot with Gensig_finsig from CHDK, with some fine results. However, those two utilities have some limitations: Quote from AI: with gensig/finsig, we can easily find addresses of known functions for a new firmware. but we can not find data or structures addresses So I'm trying to make a better version of gensig/finsig in IDAPython, which also finds structures and other kind of data. First of all... how does gensig/finsig work? It creates signatures of functions known in some versions of the firmware. Then it uses those signatures to match the same functions in other versions (usually newer, or from other camera model). A signature seems to be made of those ARM instructions: add, and, b, bl, cmp, ldr, mov, mul, rsb, rsc, str, sub, tst https://tools.assembla.com/chdk/browser/trunk/tools/gensig.c. The others are commented out (don't know why... are these ones more relevant?). How is the matching process done? With some C code https://tools.assembla.com/chdk/browser/trunk/tools/finsig.c which I don't quite understand... Matching subs So, as I don't fully understand how gensig/finsig work, I've tried to create my own routine for matching. How to identify similar functions in two different firmwares? Some rough hypotheses: * In many cases, their code is identical, but uses different addresses. * Some routines are pretty different, but they can be identified from the strings referenced by them. * If we identified two matching functions, A and B, with the same code structure, we may assume that functions called by A will match with the ones called by B. So we can: * identify functions by code structure (this seems similar to what gensig/finsig does). This is done by creating signatures from the function bodies, and comparing the signatures somehow how?. * match by strings. An example is LV_Initialize. In this case, only the first string matches, and this is also one of the method for naming subroutines http://magiclantern.wikia.com/wiki/2.0.4_IDA_%27Discovery%27_Script:_Anyone_fancy_writing * check if the functions called by two previously-matched subs with identical code structure were matched. If mismatch is found => we might have detected a false positive. * if no mismatch is found, then we have a new (possible) match! * other methods for validating the matches? Also: * some functions are small and similar in structure with lots of others (but different in data). How to match them properly? Matching structs Many struct accesses look like this: LDR R4, 0x1234 LDR R0, 0x12 So, a method for identifying them is to look for addressing modes like this: off and backtrace in order to find the value of Rx. With this, you can say: function X references structs 0x1234, 0x12345 and 0x123456. If you previously matched function X from firmware A, with func. Y from fw. B, then all you have to do is to look what functions are referenced by func. Y, let's say 0xABCD, 0xABCDE and 0xABCDEF. Then the match is straightforward: 0x1234 in fw. A <------> 0xABCD in fw. B and so on... Validating the results That's difficult. If the results from all the techniques above are consistent, that's good. In practice, they are not. Test A A.X = func. X from firmware A Suppose we found A.X and B.Y to be the same: A.X: B.Y: a a b b call A.QWE call B.RTY c c call A.UIO call B.ASD ... ... In an ideal case, A.QWE should match B.RTY and A.UIO should match B.ASD. However, they might be different functions which essentially do the same thing (in this case, they won't match at all). But if A.QWE matches with some unrelated B.XYZ, for example, then something might be wrong here. A quick sanity check: * either A.QWE matches B.RTY, or neither of them is matched with any other function * the same for A.UIO and B.ASD If the sanity check is passed for all the subs called from A.X and B.Y, then we may say A.QWE <--> B.RTY and A.UIO <--> B.ASD. But those have to be checked, too. Test B Another hint can be given by the following simple tests: * How many times is called A.X? * How many times is called B.Y? For example, DebugMsg and assert are the most called functions in the firmware. Difficulty: many functions are called indirectly, by first storing their address somewhere, and then loading them into PC. These are not easy to find... Test C Check the strings used by the two functions. Is this relevant? Test D Group functions by the structures referenced. Using the scripts See GPL_Tools/match.py.